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Abstract 

We consider the problem of sequential sampling from a finite num- 
ber of independent statistical populations to maximize the expected 
infinite horizon average outcome per period, under a constraint that 
the expected average sampling cost does not exceed an upper bound. 
The outcome distributions are not known. We construct a class of con- 
sistent adaptive policies, under which the average outcome converges 
with probability 1 to the true value under complete information for all 
distributions with finite means. We also compare the rate of conver- 
gence for various policies in this class using simulation. 



1 Introduction 

In this paper we consider the problem of sequential sampling from k inde- 
pendent statistical populations with unknown distributions. The objective 
is to maximize the expected outcome per period achieved over infinite hori- 
zon, under a constraint that the expected sampling cost per period does not 
exceed an upper bound. The introduction of a sampling cost introduces a 
new dimension in the standard tradeoff between experimentation and profit 
maximization faced in problems of control under incomplete information. 
The sampling cost may prohibit using populations with high mean out- 
comes because their sampling cost may be too high. Instead, the decision 
maker must identify the subset of populations with the best combination 
of outcome versus cost and allocate the sampling effort among them in an 
optimal manner. 



From the mathematical point of view, this class of problems incorpo- 
rates statistical methodologies into mathematical programming problems. 
Indeed, under complete information, the problem of effort allocation under 
cost constraints is typically formulated in terms of linear or nonlinear pro- 
gramming. However when some of the problem parameters are not known in 
advance but must be estimated by experimentation, the decision maker must 
design adaptive learning and control policies that ensure learning about the 
parameters while at the same time ensuring that the profit sacrificed for the 
learning process is as low as possible. 

The model in this paper falls in the general area of multi-armed bandit 
problems, which was initiated by Robbins (1952), who proposed a simple 
adaptive policy for sequentially sampling from two unknown populations in 



order to maximize the expected outcome per unit time infinite horizon. Lai 



and Robbins (1985) generalize the results by constructing asymptotically 



efficient adaptive policies with optimal convergence rate of the average out- 
come to the optimal value under complete information and show that the 
finite horizon loss due to incomplete information increases with logarithmic 
rate. Katehakis and Robbins ( 1995 ) prove that simpler index-based efficient 



policies exist in the case of normal distributions with unknown means, while 



Burnetas and Katehakis (1996) extend the results on efficient policies in the 



nonparametric case of discrete distributions with known support. 

In a finite horizon Kulkarni and Lugosi ( 2000 ) develop a minimax version 



of the Lai and Robbins (1985) results for two populations, while Auer et al. 



(2002) construct policies which also achieve logarithmic regret uniformly 
over time, rather than only asymptotically. 

In all works mentioned above there is no side constraint in sampling. 
Problems with adaptive sampling and side constraints are scarce in the 
literature. Wang (1991) considers a multi-armed bandit model with con- 



straints and adopts a Bayesian formulation and the Gittins-index approach. 
The paper proposes several heuristic policies. Pezeshk and Gittins (1999) 
also consider the problem of estimating the distribution of a single popu- 
lation with sampling cost under the assumption that the number of users 
who will benefit from the depends on the outcome of the estimation. Fi- 
nally, Madani et al. (2004) present computational complexity analysis for 
a version of the multi-armed bandit problem with Bernoulli outcomes and 
Beta priors, where there is a total budget for experimentation, which must 
be allocated to sampling from the different populations. 

Another approach, which is closer to the one we adopt here is to con- 
sider the family of stochastic approximations and reinforcement learning 
algorithms. The general idea is to select the sampled population following 
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a randomized policy with randomization probabilities that are adaptively 
modified after observing the outcome in each period. The adaptive scheme 
is based on the stochastic approximation algorithm. Algorithm of this type 
are analyzed in Poznyak et al. (2000) for the more general case where the 
population outcomes have Markovian dynamics instead of being i.i.d.. 

The contribution of this paper is the construction of a family of policies 
for which the average outcome per period converges to the optimal value 
under complete information for all distributions of individual populations 
with finite means. In this sense, it generalizes the results of Robbins ( 1952 ) 
by including a sampling cost constraint. The paper organized as follows. 
In Section 2, we describe the model in the complete and incomplete infor- 
mation framework. In Section 3, we construct a class of adaptive sampling 
policies and prove that it is consistent. In Section 4, we explore the rate of 
convergence of the proposed policies using simulation. Section 5 concludes. 



2 Model description 

Consider the following problem in adaptive sampling. There are k indepen- 
dent statistical populations, i = 1, . . . , k. Successive samples from popula- 
tion i constitute a sequence of i.i.d. random variables Xn, Xi2, ■ ■ ■ following 
a univariate distribution with density /j(-) with respect to a nondegenerate 
measure v. Then the stochastic model is uniquely determined by the vector 
/ = (fi, . . . , fk) of individual pdf's. Given f let /u(/) be the vector of ex- 
pected values, i.e. = E^(Xi). The form of / is not known. In each 
period the experimenter must select a population to obtain a single sample 
from. Sampling from population i incurs cost Cj per sample and without loss 
of generality we assume ci < C2 < . . . < Cfc, but not all equal. The objec- 
tive is to maximize the expected average reward per period subject to the 
constraint that the expected average sampling cost per period over infinite 
horizon does not exceed a given upper bound Co . Without loss of generality 
we assume c\ < Co < c&. Indeed if Co < c\ then the problem is infeasible. 
On the other hand if Co > Ck then the cost constraint is redundant. Let 
d = max{j : Cj < Co}. Then 1 < d < k and Q < Co < c^+i- 

2.1 Complete information framework 

We first analyze the complete information problem. If all /j(-) are known, 
then the problem can be modeled via linear programming. Consider a ran- 
domized sampling policy which at each period selects population j with 
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probability Xj, for j = 1, . . . , k. To find a policy that maximizes the ex- 
pected reward, we can formulate the following linear program in standard 
form 



k 

z* = max fijXj 

3=1 

k 

^2cjXj + y = C (1) 

3=1 
k 

J2 x i = 1 

3=1 

xj > 0,Vj. 

Note that z* depends on / only through the vector //(/), i.e. z* is the same 
for all collections of pdf with the same \x. Therefore in the remainder we 
will denote Z clS Si function of the unknown mean vector /x. 

In the analysis we will also use the dual linear program (DLP) of 0, 

z* D = min g + CqX 
g + ci\> in 



g + c k X> fi k 
g £ M,A > 0, 

with two variables A and g which correspond to the first and second con- 
straints of ([l]), respectively. 

The basic matrix B corresponding to a Basic Feasible Solution (BFS) of 
problem ([T]) may take one of two forms: 

In the first case, the basic variables ell" 6 0C 2 ) X q . for two populations with 
Ci < Cq < Cj , Q < Cj , and the basic matrix is 




The BFS is then 

Xi = — -, xj = — and x m = for m / i,j, y = 0, 

Q Cj Ci Cj 
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with 

The solution is nondegenerate when c% < Cq < Cj and degenerate when 
Cq = Q or Co = Cj. In the latter case, it corresponds to sampling from a 
single population I = i or I = j, respectively: 

xi = 1, x m = Vm / I, y = 0, 

with 

z(x) = Hi. 

The second case of a BFS corresponds to basic variables Xi, y for a pop- 
ulation i with q < Cq. The basic matrix is 

*-(?;■ 

In this case the BFS corresponds to sampling from population i only 

Sj = 1, % = Vm / i, y = C - Cj, 

with 

z(x) = m. 

The solution is nondegenerate if c» < Co, otherwise it is degenerate. 

From the above it follows that a BFS is degenerate if xi = 1 for some 
I with c/ = Co- Any basic matrix B that includes x\ as a basic variable 
corresponds to this BFS. 

For a BFS x let 

6 = {i : Xi > 0}. 

Then, either b = for some i,j with i<d<j,orb = {i} for some 

i < d. There is a one to one correspondence between basic feasible solutions 
and sets b of this form. We use K to denote the set of BFS, or equivalently 

K = {b : b = {i,j}, i < d < j or b = {»}, i < d}. 

Since the feasible region of is bounded, K is finite. 

For a basic matrix B, let v B = (X B ,g B ) denote the dual vector corre- 
sponding to B, i.e., v B = hbB~ 1 , where \iB = or Hb = (^1,0), 
depending on the form of B. 
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Regarding optimality, a BFS is optimal if and only if for at least one 
corresponding basic matrix B the reduced costs (dual slacks) are all non- 
negative: 

<f>* = c a X B + g B - n a > 0, a = 1, . . . , k. 

A basic matrix B satisfying this condition is optimal. Note that if an 
optimal BFS is degenerate, then not all basic matrices corresponding to it 
are necessarily optimal. 

It is easy to show that the reduced costs can be expressed as a linear 
combinations 4> B = w B fi, where w B is an appropriately defined vector that 
does not depend on fj,. 

We finally define the set with optimal solutions of ([!]) for a /i, 

s(/x) = {b 6 K : b corresponds to an optimal BFS}. 

An optimal solution of ([!]) specifies randomization probabilities that 
guarantee maximization of the average reward subject to the cost constraint. 
Note that an alternative way to implement the optimal solution, without 
randomization, is to sample periodically from all populations so that the 
proportion of samples from each population j is equal to Xj. This charac- 
terization of a policy is valid if randomization probabilities are rational. 

2.2 Incomplete information framework 

In this paper we assume that the population distributions are unknown. 
Specifically we make the following assumption. 

Assumption 1 The outcome distributions are independent, and the ex- 
pected values fi a = E(X a ) < oo, a = 1, . . . , k. 

Let F be the set of all / = (fi, . . . , fk) which satisfy A.l. Class F is 
the effective parameter set in the incomplete information framework. Under 
incomplete information, a policy as that in Section 2.1, which depends on 
the actual value of fi, is not admissible. Instead we restrict our attention to 
the class of adaptive policies, which depend only on the past observations 
of selections and outcomes. 

Specifically, let At,Xt , t = 1,2,... denote the population selected and 
the observed outcome at period t. Let h t = {a\,X\, ....,ctt-i,xt-i) be the 
history of actions and observations available at period t. 

An adaptive policy is defined as a sequence ir = (tti,tt2, ■■■) of history 
dependent probability distributions on {!,...,&}, such that 
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Tr t {j,h t ) = P(A t = j\h t ). 



Given the history h n , let T n (a) denote the number of times population 
a has been sampled during the first n periods 

n 

T n {a) = HA = a}. 
t=i 

Let S% be the reward up to period n: 

n 
t=l 

and be the total cost up to period n: 

n 

t=i 

These quantities can be used to define the desirable properties of an 
adaptive policy, namely feasibility and consistency. 

Definition 1 A policy tt is called feasible if 

limsup < C , V/ G F. (2) 

n— >oo Tl 

Definition 2 A policy tt is called consistent if it is feasible and 

lim ^ = z*U), a.s. V/ G F. 
n— >oo n 

I ' ■ ' n F and Ii c denote the class of feasible and consistent policies, re- 
spectively. The above properties are reasonable requirements for an adaptive 
policy. The first ensures that the long-run average sampling cost does not 
exceed the budget. The second definition means that the long-run average 
outcome per period achieved by tt converges with probability one to the 
optimal expected value that could be achieved under full information, for 
all possible population distributions satisfying A.l. 

Note that consistency as defined in Definition 2 is equivalent to the 
notion of strong consistency of an estimator function. 
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3 Construction of a consistent policy 



A key question in the incomplete information framework is whether feasi- 
ble and, more importantly, consistent policies exist and how they can be 
constructed. 

It is very easy to show that feasible policies exist, since the sampling 
costs are known. Indeed any randomized policy, such as those defined in 
Section 2.1, with randomization probabilities satisfying the constraints of 
LP ([!]) is feasible for any distribution /. Thus, LT^ 7^ 0. 

On the other hand, the construction of consistent policies is not trivial. 
A consistent policy must accomplish three goals: First to be feasible, second 
to be able to estimate the mean outcomes from all populations, and third, 
in the long-run, to sample from the nonoptimal populations rarely enough 
so as not to affect the average profit. 

In this section we establish the existence of a class of consistent policies. 



The construction follows the main idea of Robbins (1952), based on sparse 
sequences, which is adapted to ensure feasibility. 

We start with some definitions. For any population j, let fajj, t = 1, 2, . . . 
be a strongly consistent estimator of fij, i.e. limt_>oo £tj,t = fJ>j a.s.-fj. Such 
estimators exist; for example from Assumption 1, the sample mean Xjt 
] J2k=i Xj,k is strongly consistent. 

For any n, let jl = (Aj,Tj(n)) j = 1, . . . ,k) be the vector estimates of fi 
based on the history up to period n. Also let z n = z{p, ) denote the optimal 
value of the linear program in ([!]) where the estimates are used in place of 
the unknown mean vector in the objective. z n will be referred to as the 
Certainty-Equivalence LP. Note that s(fx ) is the set of optimal BFS of z n . 

The solution of z n corresponds to a sampling policy determined by an 
optimal vector x n , so that z n = jj, x n . 

We next define a class of sampling policies, which we will show to be 
consistent. Consider k nonoverlapping sparse sequences of positive integers, 

Tj = {Tj : m, Tfl = 1, 2, . . .}, j ' = 1, . . . , k, 

such that ^ 

lim = 00, j = 1, . . . , k. (3) 

m— >oo Tfl 

Now define policy tt° which in period n selects any population j with 
probability equal to 



1 , if Tj t m = n for some m > 1 
,j , otherwise 



S 



where x n is any optimal BFS of the certainty-equivalence LP z n . 

The main idea in 7r° is that at periods which coincide with the terms 
of sequence Tj, population j is selected regardless of the history. These 
instances are referred to as forced selections of population j. The purpose 
of forced selections is to ensure that all populations are sampled infinitely 
often, so that the estimate vector fi converges to the true mean \i as n — > oo. 

On the other hand, because sequences Tj are sparse, the fraction of forced 
selections periods converges to zero for all j, so that sampling from the 
nonoptimal populations does not affect the average outcome in the long- 
run. 

In the remaining time periods, which do not coincide with a sparse se- 
quence term, the sampling policy is that suggested by the certainty equiva- 
lence LP, i.e., the experimenter in general randomizes between those popu- 
lations, which, based on the observed history, appear to be optimal. 

In the next theorem we prove the main result of the paper, namely that 



7T° G H c . The proof adapts the main idea of Robbins (1952) to the problem 
with the cost constraint. 

Theorem 1 Policy tt° is consistent. 

Before we show Theorem 1, we prove an intermediate result which shows 
that if in some period the certainty equivalence LP yields an optimal solution 
that is non-optimal under the true distribution /, then the estimate of at 
least one population mean must be sufficiently different from the true value. 
We use the supremum norm = maxj \x\. 

Lemma 1 For any fi there exists e > such that for any n = 1,2,... if 
b G and b ^ s(/u) for some b G K , then — /xj| > e. 

Proof. Since b G' s(/n), we have that for any basic matrix B' corresponding 
to BFS b there exists at least one m G {1, ...,k} such that 4>^(n) < 0. 
Therefore, 

-W%f±= -0mV) > °- (4) 
In addition, since b G s(A n ); there exists a basic matrix B corresponding 
to b, such for any m G {1, . . . , k} it is true that 0^(/t ) > 0, thus, 



W_*tt n = ^m(A n ) > 0. (5) 

For this basic matrix B, it follows from Q and ^ that 
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=><(A n -M)>IC(M)l 

^fcll^llll^-Mll > I^MI 
„^l<&)l 

-» - k\\wg\\ 

because from the property w^ n fi < it follows that ||«^|| > 0. 
Now let 

. f \4%(v)\ , Bn .1. 
e = mm mm mm < — — =t- : (p^iu) < > > 0. 
beKfiis{p) Beb m e{i,...,k} I - J 

where the minimization over B 6 6 is taken over all basic matrices corre- 
sponding to BFS 6. 
Then llu — a\\ > e. 

"LZn LZ" — 

■ 

Proof of Theorem 1. 

For i = 1, . . . , A; let 

n 

SSi(n) = ^{ T i,m = for some m}, 
t=i 

denote the number of periods in {1, . . . ,n} where a forced selection from 
population i is performed. 
Also let, 



Yj{n) = ^2 1{& e s (A t )) & is use d in period i, and j is sampled from, 
t=i 

due to randomization in b}. 
Y\n) = £l?(n), 

Y» = £ Y b (n). 

Since these include all possibilities of selection in a period, it is true that 
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k 

n = J2sSi(n)+ Y b (n)+ ^ Y \ n )- 

Now let W n denote the sum of outcomes in periods where true optimal 
BFS are used: 

n 

6es(>) t=i 

To show the theorem we will prove that 



lim ^M =0j i = l,...,k (6) 



lim Y = 0, a.s., (7) 

lim — - = z*(n), a.s.. (8) 

71— >oo 71 — 

First, (|6]) holds since Tj jm are sparse for all i. To show ([7]), in no forced 
selection periods, in order to sample from a BFS b it is necessary but not 
sufficient that b G s(/t ), thus 

^(n)<^l{6G S (A t )}- 
t=i 

For any 6 G S (A+) an d & ^ KaOi it follows from Lemma 1 that 



IA t -/fll > e - 



Therefore, for 6 ^ s(//) 



F 6 (n) < ^l{6s S (A t )} 

n 



t=i 

thus, 
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n n 

t=i 



because jl — > //, a.s., since /} is strongly consistent estimator, thus Q holds. 
Now to show (pi) we rewrite W n as 



Wn 

n n 



- ^2 ^2 X t ■ l{b is used in period t} 

" bes(^) t=i 

1 n 

- ^ l ' * s use< ^ m P er i°d * an d j is sampled from} 



n 

bes(n) jeb t=i 



_ Y\n) 

66s(m) jefc v ; 

From this expression it follows that 



^ n 



Y b (n) 

where z b n = £ je{( ^ ■ 



Since y(n) = X^es(» ^ & ( n )' we nave 



— - z* = > — — • z b „ - z* H — — z* 

n n n n 

bes(jj,) 



E 5 ^ ■« -o-d y(n) - 

'>es(>) 

To show (ISt) we will prove that 



z . 



n n 
bes(fi) 
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• (z b n -z*)^0 a.s. V6 G s(n), and -> 1, a.,, 



n — n 

Random variable Y fe (n) is increasing in n and < Y b (n) < n, thus either 
Y h {n) — >■ oo or Y b {n) -»• Mfor some M < oo. We define the following events: 

D = {y 6 (n) -»• 00} and D c = {y b (n) -»• M}. 
Now let P(Z>) = p and P(D C ) = 1 - p. Also let 

A = { lim — ^ • (z b - z*) = 0}. 

Then P(A) = P(A\D) ■ p + P(A\D C ) ■ (1 - p). 
Now, 



P(AID) = P( lim • (4 - z*) = 0| lim Y b (n) = 00) 



> P( lim z b n - z* = 0| lim Y 6 (n) = 00) 

n— >oo rt--s>oo 



\ 11111 4, n — 6 — U| 11111 2 

n— >oo n— >oo 

1, 

y 6 (n) 



from the strong law of large numbers, since — < 1 V n, and 



P(A\D C ) = P( lim • (4 - z*) = 01 lim Y 6 (n) = M < 00) = 1, 

since in this CclSG Z n Z IS bounded for any finite n. 
Therefore, P{A) = 1, thus 



Y b {n) , b 



n 

Finally, 



[z n — z*) — > 0, n — > 00, a.s. , Mb G s(/x). 



n ^— ' n ^— ' n ^— ' n 

Thus the proof of the theorem is complete. 
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4 Rate of Convergence - Simulations 



From the results of the previous section it follows that there exists signifi- 
cant flexibility in the construction of a consistent sampling policy. Indeed, 
any collection of sparse sequences of forced selection periods satisfying ^ 
guarantees that Theorem [T] holds. 

In this section we refine the notion of consistency and examine how the 
rate of convergence of the average outcome to the optimal value is affected 
by different types of sparse sequences. Furthermore, since the sensitivity 
analysis will be performed using simulation, it is more appropriate to use 
the expected value of the deviation as the convergence criterion. We thus 
consider the expected difference of the average outcome under a consistent 
policy 7r from the optimal value: 

a«M = E* -*•(£). 

Note that the almost sure convergence of ^ to z*(n) proved in Theo- 
rem [T] does not imply convergence in expectation, unless further technical 
assumptions on the unknown distributions are made. For the purpose of 
our simulation study, we will further assume that the outcomes of any pop- 
ulation are absolutely bounded with probability one, i.e., < u) = 1, 
for some u > 0. Under this assumption it is easy to show that Theorem 1 
implies 

lim dlU) = 0, (9) 

n— »oo — 

for any consistent policy ir and any vector /i. 

To explore the rate of convergence in ^9]), we performed a simulation 
study, for a problem with k = 4 populations. The outcomes of popula- 
tion i follow binomial distribution with parameters (N,pi), where pi = 
0.3, p2 = 0.5, P3 = 0.9, pi = 0.8. The vector of expected values is thus 
// = (1.5,2.5,4.5,4). The cost vector is c = (3,4,8,10) and C = 5. Un- 
der this set of values the optimal policy under incomplete information is 
x = (0,3/4,1/4,0), y = and z*(fi) = 3, i.e., it is optimal to randomize 
between populations 2 and 3, the expected sampling cost per period is equal 
to 5 and the expected average reward per period is equal to 3. 

For the above problem we simulated the performance of a consistent 
policy for sparse sequences of power function form: 

{Tj, m = £j +m b ,m = 1,2, ... , }, j = l,...,k, 
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Figure 1: Comparison of Convergence Rates for Power Sparse Sequences 

where lj are appropriately defined constants which ensure that the sequences 
are not overlapping, and the exponent parameter b is common for all popu- 
lations. We compared the convergence rate in ^ for five values of b: (1.2, 
1.5, 2, 3, 5). For each value of b the corresponding policy was simulated for 
1000 scenarios of length n = 10 4 periods each, to obtain an estimate of the 
expected average outcome per period d^(fi). The results of the simulations 
are presented in Figure [T] 

We observe in Figure [I] that the convergence is slower both for small and 
large values of b and faster for intermediate values. Especially for b = 1.2 the 
difference is relatively large even after 10000 periods. This is explained as fol- 
lows. For small values of b the forced selections are more frequent. Although 
this has the desirable effect that the mean estimates for all populations be- 
come accurate very soon, it also means that non-optimal populations are 
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also sampled frequently because of forced selections. As a result the average 
outcome may deviate from the true optimal value for a longer time period. 
On the other hand, for large values of b the sequences Tj all become very 
sparse and thus the forced selections are rare. In this case it takes a longer 
time for the estimates to converge, and the linear programming problems 
may produce non-optimal solutions for long intervals. 

It follows from the above discussion that intermediate values of b are 
generally preferable, since they offer a better balance of the two effects, fast 
estimation of all mean values and avoiding non optimal populations. This 
is also evident in the graph, where the value 6 = 2 seems to be the best in 
terms of speed of convergence. 

To address the question of accuracy of the comparison of convergence 
rates based on simulation, Figure [2] presents a 95% confidence region for 
the average outcome curve corresponding to b = 2, based on 1000 simulated 
scenarios. The confidence region is generally very narrow (note that the 
vertical axes have different scale in the two figures), thus the estimate of 
the expected average outcome is quite accurate. This is also the case for 
the other curves, therefore the comparison of convergence rates is valid. 
Furthermore, the length of the confidence interval becomes smaller for larger 
time periods since, as expected, the convergence to the true value is better 
for longer scenario durations. 

Another issue arising from Figure [T] is the following. For b = 1.2 the 
average outcome converges very slowly to z* , but remains above it for the 
entire scenario duration. Thus it could be argued that, although the conver- 
gence is not good, this policy is actually preferable, because it yields higher 
average outcomes than the other policies. It also seems to contradict the fact 
that z* is the maximum average outcome under complete information, since 
there is a sampling policy that even under incomplete information performs 
better. 

The reason for this discrepancy is related to the form of the cost con- 
straint ([2]). The constraint requires the infinite- horizon expected cost per 
period not to exceed C Q . This does not preclude the possibility that one 
or more populations with large sampling costs and large expected outcomes 
could be used for arbitrarily long intervals before switching to a constrained- 
optimal policy for the remaining infinite horizon. Such policies might achieve 
average rewards higher than z* for long intervals, however this is achieved 
by "borrowing", i.e., violating the cost constraint, also for long time periods. 
Since Q is only required to hold in the limit, this behavior of a policy is 
allowed. 

Although the consistent policies in Section 3 are not designed specifically 



16 



Average Reward 




Figure 2: Confidence Region for Average Outcome for 6 = 2 



to take advantage of this observation, they are neither designed to avoid it. 
Therefore, it is possible, as it happens here for 6 = 2, that a consistent 
policy may achieve higher than optimal average outcomes for long time 
periods before it converges to z* . 

The above discussion shows that the constraint as expressed in ([2]), may 
not be appropriate, if for example the sampling cost is a tangible amount 
that must be paid each time an observation is taken, and there is a budget 
Co per period for sampling. In this situation a policy may suggest exceeding 
the budget for long time periods and still be feasible, something that may 
not be viable in reality. In such cases it would be more realistic to impose a 
stricter average cost constraint, for example to require that Q hold for all 
n and not only in the limit. 
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5 Conclusion and Extensions 



In this paper we developed a family of consistent adaptive policies for se- 
quentially sampling from k independent populations with unknown distri- 
butions under an asymptotic average cost constraint. The main idea in the 
development of this class of policies is to employ a sparse sequence of forced 
selection periods for each population, to ensure consistent estimation of all 
unknown means and in the remaining time periods employ the solution ob- 
tained from a linear programming problem that uses the estimates instead 
of the true values. We also performed a simulation study to compare the 
convergence rate for different policies in this class. 

This work can be extended in several directions. First, as it was shown 
in Section 4, the asymptotic form of the cost constraint is in some sense 
weak, since it allows the average sampling cost to exceed the upper bound 
for arbitrarily long time periods and still be satisfied in the limit. A more 
appropriate, albeit more complex, model would be to require the cost con- 
straint to be satisfied at all time points. The construction of consistent 
and, more importantly, efficient policies under this stricter version of the 
constraint is work currently in progress. 

Another extension is towards the direction of Markov process control. 
Instead of assuming distinct independent populations with i.i.d. observa- 
tions, one might consider an average reward Markovian Decision Process 
with unknown transition law and/or reward distributions, and one or more 
nonasymptotic side constraints on the average cost. In this case the problem 
is to construct consistent and, more importantly, efficient control policies, 



extending the results of Burnetas and Katehakis (1997) in the constrained 
case. 
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